Motivation
PPforest, Projection pursuit random forest
Visually exploring a PPforest object
Final comments
16 July, 2018
Motivation
PPforest, Projection pursuit random forest
Visually exploring a PPforest object
Final comments
PPforest is a new supervised method based on bagged projection pursuit trees for classification problems.
This method improves the predictive performance of RF when the separation between classes is in combinations of variables.
Black box model having better tools to open up black box models will provide for better understanding the data, the model strengths and weaknesses, and how the model will performs on future data.
Ensembles learning methods: combined multiple individual models trained independently to build a prediction model.
Some well known examples of ensemble learning methods are, boosting (Schapire 1990), bagging (Breiman 1996) and random forest (Breiman 2001) among others.
Main differences between ensembles, type of individual models to be combined and the ways these individual models are combined.
PPforest is an ensemble learning method, built on bagged trees.
Main concepts:
Bootstrap aggregation (Breiman (1996) and Breiman and others (1996))
Random feature selection (Amit and Geman (1997) and Ho (1998)) to individual classification trees for prediction.
The individual classifier in PPforest is a PPtree (Lee et al. 2005).
The splits in PPforest are based on a linear combination of randomly chosen variables. Utilizing linear combinations of variables the individual model (PPtree) separates classes taking into account the correlation between variables.
PPtree combines tree structure methods with projection pursuit dimension reduction. PPtree treats the data always as a two-class system.
When the classes are more than two the algorithm uses a two step projection pursuits optimization in every node split.
PPforest is on CRAN
Initial version was develop entirely in R, not fast enought
OOB Error rate
Variable importance
Proximity matrix
Structuring data and constructing plots to explore forest classification models interactively.
We proposed a method to explore and diagnostic ensemble classifiers based on three levels of analysis:
Key part of the visualization is the use of interactive visualization methods.
Interactive web-based visualization of ensemble methods.
Two key components that an interactive graphic should accomplish:
Why we should use interactive visualizations?
To see connections inside each level that cannot be seen in a static graphs.
Links at case level allows to identify cases where the model is not working properly and allows to characterize this case base on the original data.
Also we can identify individual models in the ensemble that are not good enough and why this is happening.
The last level of analysis focused on model comparison based on predicted performance by class.
159 fishes of 7 species (Bream, Parkki, Perch, Pike, Roach, Smelt and Whitewish) are caught and measured, 6 variables.
| Varible | Description |
|---|---|
| weight | Weight of the fish (in grams) |
| length1 | Length from the nose to the beginning of the tail (in cm) |
| length2 | Length from the nose to the notch of the tail (in cm) |
| length3 | Length from the nose to the end of the tail (in cm) |
| height | Maximal height as % of Length3 |
| width | Maximal width as % of Length3 |
Having better tools to open up black box models will provide for better understanding the data, the model strengths and weaknesses, and how the model will performs on future data.
This visualisation app provides a selection of interactive plots to diagnose PPF models.
This shell could be used to make an app for other ensemble classifiers.
Combining shiny, ggplot2 and plotly we can develop informative interactive visualizations
Amit, Yali, and Donald Geman. 1997. “Shape Quantization and Recognition with Randomized Trees.” Neural Computation 9 (7). MIT Press:1545–88.
Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer:123–40.
———. 2001. “Random Forests.” Machine Learning 45 (1). Springer:5–32.
Breiman, Leo, and others. 1996. “Heuristics of Instability and Stabilization in Model Selection.” The Annals of Statistics 24 (6). Institute of Mathematical Statistics:2350–83.
Ho, Tin Kam. 1998. “The Random Subspace Method for Constructing Decision Forests.” Pattern Analysis and Machine Intelligence, IEEE Transactions on 20 (8). IEEE:832–44.
Lee, Eun-Kyung, Dianne Cook, Sigbert Klinke, and Thomas Lumley. 2005. “Projection Pursuit for Exploratory Supervised Classification.” Journal of Computational and Graphical Statistics 14 (4).
Schapire, Robert E. 1990. “The Strength of Weak Learnability.” Machine Learning 5 (2). Springer:197–227.